Combining Feature Selection and Ensemble Learning for Software Quality Estimation
نویسندگان
چکیده
High dimensionality is a major problem that affects the quality of training datasets and therefore classification models. Feature selection is frequently used to deal with this problem. The goal of feature selection is to choose the most relevant and important attributes from the raw dataset. Another major challenge to building effective classification models from binary datasets is class imbalance, where the minority class has far fewer instances than the majority class. Data sampling (altering the dataset to change its balance level) and boosting (building multiple models, with each model tuned to work better on instances misclassified by previous models) are common techniques for resolving this problem. In particular, ensemble boosting, which integrates sampling with AdaBoost, has been shown to improve classification performance, especially for imbalanced training datasets. In this paper, we investigate approaches for combining feature selection with this ensemble learning (boosting) process. Six feature selection techniques and two forms of the ensemble learning method are examined. We focus on two different scenarios: feature selection performed prior to the ensemble learning process and feature selection performed inside the ensemble learning process. The experimental results demonstrate that performing feature selection inside of ensemble boosting generally performs better than using feature selection prior to ensemble boosting.
منابع مشابه
Comparing Two Approaches for Adding Feature Ranking to Sampled Ensemble Learning for Software Quality Estimation
High dimensionality and class imbalance are two main problems that affect the quality of training datasets in software defect prediction, resulting in inefficient classification models. Feature selection and data sampling are often used to overcome these problems. Feature selection is a process of choosing the most important attributes from the original data set. Data sampling alters the data s...
متن کاملBridging the semantic gap for software effort estimation by hierarchical feature selection techniques
Software project management is one of the significant activates in the software development process. Software Development Effort Estimation (SDEE) is a challenging task in the software project management. SDEE is an old activity in computer industry from 1940s and has been reviewed several times. A SDEE model is appropriate if it provides the accuracy and confidence simultaneously before softwa...
متن کاملEnsemble Classification and Extended Feature Selection for Credit Card Fraud Detection
Due to the rise of technology, the possibility of fraud in different areas such as banking has been increased. Credit card fraud is a crucial problem in banking and its danger is over increasing. This paper proposes an advanced data mining method, considering both feature selection and decision cost for accuracy enhancement of credit card fraud detection. After selecting the best and most effec...
متن کاملCombining Classifier Guided by Semi-Supervision
The article suggests an algorithm for regular classifier ensemble methodology. The proposed methodology is based on possibilistic aggregation to classify samples. The argued method optimizes an objective function that combines environment recognition, multi-criteria aggregation term and a learning term. The optimization aims at learning backgrounds as solid clusters in subspaces of the high...
متن کاملCombining Classifier Guided by Semi-Supervision
The article suggests an algorithm for regular classifier ensemble methodology. The proposed methodology is based on possibilistic aggregation to classify samples. The argued method optimizes an objective function that combines environment recognition, multi-criteria aggregation term and a learning term. The optimization aims at learning backgrounds as solid clusters in subspaces of the high...
متن کامل